Introduction

In this lab, we will be demonstrating how to tune and fit random forest models and gradient-boosted tree models (or “boosted trees”). We’ll demonstrate each of these using two different data sets – a regression data set (the Wage data from Lab 6) and a classification data set with multiple levels of the outcome (the palmerpenguins data, see here).

Loading Packages

library(tidymodels)
library(ISLR)
library(ISLR2)
library(tidyverse)
library(glmnet)
library(modeldata)
library(ggthemes)
library(janitor)
library(naniar)
library(xgboost)
library(ranger)
library(vip)
library(corrplot)
library(palmerpenguins) # for the penguins data
tidymodels_prefer()

Wage Data

The Wage data was described fairly thoroughly in Lab 6 – it consists of wage information and some information on several other descriptive variables for about 3,000 males in the mid-Atlantic region of the US. It’s loaded by default when we load the ISLR or ISLR2 package(s), so we just convert it to a tibble and use clean_names() to apply the snake case naming convention to the column names.

wage <- as_tibble(Wage) %>% 
  clean_names()

You can revisit some of the EDA we did with this data in Lab 6. Here, we’ll simply proceed to performing an initial split of the data into training and testing sets, then apply 5-fold cross-validation to the training set. Note that we make sure to use strata = "wage" for both of these functions, stratifying on the outcome variable.

set.seed(3435)
wage_split <- initial_split(wage, strata = "wage")

wage_train <- training(wage_split)
wage_test <- testing(wage_split)

wage_folds <- vfold_cv(wage_train, v = 5, strata = "wage")

Penguin Data

Artwork by @allison_horst

One of the classic commonly-used data sets for multiclass classification problems is iris. However, for several good reasons, there has been discussion in the data science community about whether continued use of iris is desirable or ethical. (For those interested, I recommend this blog post: It’s time to retire the iris data set.)

The palmerpenguins data set was introduced as an alternative to iris by Allison Horst (who wrote the R package). It is very similar in structure, in that it contains observations on 344 penguins from three different species (as visualized above) – the Chinstrap, Gentoo, and Adelie penguins – who live on three islands in the Palmer Archipelago, Antarctica. The data set contains information about the penguins’ bill length and depth in millimeters, their flipper length in millimeters, their body mass in grams, and their sex (male or female).

We will work with the palmerpenguins data set. We read it in here:

penguins <- as_tibble(palmerpenguins::penguins)

We can visualize any missingness:

vis_miss(penguins)

Here, it will probably just be easier to handle missingness by dropping the observations that have any values missing:

penguins <- penguins %>% 
  drop_na()

In doing so, we only lose about 11 observations. That’s because about \(3.2\%\) of sex observations are missing, and \(0.032(344) \approx 11\). It looks like one or two other penguins are missing some other values, but those are also missing their value for sex, so there are only 11 observations dropped total.

And we’ll do an initial split and 5-fold cross-validation, again stratifying on the outcome variable, which here is the species of penguin (or species):

set.seed(3435)
penguin_split <- initial_split(penguins, strata = "species")

penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

penguin_folds <- vfold_cv(penguin_train, v = 5, strata = "species")

Let’s do a little EDA. First of all, how does the outcome variable look – is it balanced (equal counts per class)?

This code also demonstrates how to make a custom color palette by specifying color hex codes and how to use that palette manually in a ggplot.

custom_pal <- c("#fb7504", "#c65ccc", "#047374")
penguin_train %>% 
  ggplot(aes(x = forcats::fct_infreq(species), fill = species,
             y = (..count..)/sum(..count..))) + 
  geom_bar() +
  scale_fill_manual(values = custom_pal)+
  xlab("Species") +
  ylab("Proportion of Training Set") +
  theme_minimal() +
  theme(legend.position = "none")

Chinstrap penguins are the least common, at only about \(20\%\) of the total, but even so, this doesn’t appear to be a huge imbalance, and we can likely get away without needing to upsample or downsample (although we could certainly try it if we chose).

penguin_train %>% 
  ggplot(aes(x = island, fill = species)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = custom_pal) +
  xlab("Island") +
  ylab("Count") +
  theme_minimal()

We definitely also might want to include the specific archipelago island as a predictor of penguin species, as shown by the above plot. Although Adelie penguins are found in approximately equal amounts on all three islands, Gentoo penguins were only found on Biscoe and Chinstrap penguins only on Dream.

We can also try visualizing the relationship between flipper length and body mass by species, as shown:

penguin_train %>% 
  ggplot(aes(x = flipper_length_mm, y = body_mass_g, 
             color = species)) +
  geom_point() +
  scale_color_manual(values = custom_pal) +
  theme_minimal() +
  xlab("Flipper Length (mm)") +
  ylab("Body Mass (g)") + 
  labs(color = "Species")

Activities

  • Describe the relationship(s) between body mass, flipper length, and species in the above picture. Do these predictors seem like they can help differentiate between species?

  • Make a correlation plot of the numeric variables in penguins. What relationships do you see?

Random Forest Models

For the wage data, we set up the same recipe as we did for a single pruned decision tree in Lab 6 – that is, using all predictors, but removing region (which is a constant) and logwage (which is a direct linear function of wage and therefore would make no sense to include). We dummy-code any nominal predictors and normalize all predictors:

rec_wage <- recipe(wage ~ ., data = wage_train) %>%
  step_rm(region, logwage) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_predictors())

For the penguin data, we remove year, which just represents the year in which the study was conducted/the penguins were recorded and likely doesn’t convey much information, and then do the same steps:

rec_penguin <- recipe(species ~ ., data = penguin_train) %>% 
  step_rm(year) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_normalize(all_predictors())

We then set up two separate random forest workflows, one for regression (rf_reg_wf) and one for classification (rf_class_wf), or one for the wage data and one for the penguins data, respectively. The default engine for random forest models in tidymodels is the ranger package. We flag three hyperparameters for tuning – mtry, trees, and min_n.

rf_reg_spec <- rand_forest(mtry = tune(), 
                           trees = tune(), 
                           min_n = tune()) %>%
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("regression")

rf_reg_wf <- workflow() %>% 
  add_model(rf_reg_spec) %>% 
  add_recipe(rec_wage)

rf_class_spec <- rand_forest(mtry = tune(), 
                           trees = tune(), 
                           min_n = tune()) %>%
  set_engine("ranger") %>% 
  set_mode("classification")

rf_class_wf <- workflow() %>% 
  add_model(rf_class_spec) %>% 
  add_recipe(rec_penguin)

We set up a grid of hyperparameter values to consider. Here we allow mtry to range from 1 to 6, trees from 200 to 600, and min_n from 10 to 20, and we specify five levels of each:

rf_grid <- grid_regular(mtry(range = c(1, 6)), 
                        trees(range = c(200, 600)),
                        min_n(range = c(10, 20)),
                        levels = 5)
rf_grid
## # A tibble: 125 × 3
##     mtry trees min_n
##    <int> <int> <int>
##  1     1   200    10
##  2     2   200    10
##  3     3   200    10
##  4     4   200    10
##  5     6   200    10
##  6     1   300    10
##  7     2   300    10
##  8     3   300    10
##  9     4   300    10
## 10     6   300    10
## # … with 115 more rows

In this code chunk, we actually fit all the random forest models we’ve specified to each data set. Notice that we have set eval=FALSE for this code chunk and the following one; running the models takes a few minutes and is something we want to do as few times as possible, so we save the results to files and load them back in later.

tune_reg <- tune_grid(
  rf_reg_wf, 
  resamples = wage_folds, 
  grid = rf_grid
)
tune_class <- tune_grid(
  rf_class_wf,
  resamples = penguin_folds,
  grid = rf_grid
)

Here we saved the results to two files:

save(tune_reg, file = "tune_reg.rda")
save(tune_class, file = "tune_class.rda")

And finally, in this code chunk, we load the model results back in and take a look at each of them:

load("tune_reg.rda")
load("tune_class.rda")

autoplot(tune_reg) + theme_minimal()

autoplot(tune_class) + theme_minimal()

For the wage data, as we increase mtry, model performance seems to improve – RMSE tends to decrease and \(R^2\) tends to increase – until we reach about mtry of 3, at which point it starts to decrease again. The number of trees, trees, doesn’t make much of a difference overall, as those lines are virtually on top of each other (and that tends to be the case). A minimal node size, or min_n, of 20 (the plots on the far right) seems to produce slightly better results than a minimum size of 10 (plots on the far left).

We’ll select the optimal random forest model for each data set – in terms of RMSE for the wage data and in terms of ROC AUC for the penguins data:

show_best(tune_reg, n = 1)
## # A tibble: 1 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     4   600    20 rmse    standard    33.7     5    1.06 Preprocessor1_Model1…
best_rf_reg <- select_best(tune_reg)

show_best(tune_class, n = 1)
## # A tibble: 1 × 9
##    mtry trees min_n .metric .estimator  mean     n  std_err .config             
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>    <dbl> <chr>               
## 1     2   200    10 roc_auc hand_till   1.00     5 0.000152 Preprocessor1_Model…
best_rf_class <- select_best(tune_class)

Activities

  • How many random forest models are we fitting across folds for each data set (that is, how many are we fitting for the wage data? How many for the penguins data?)

  • Interpret the autoplot() results for the penguins data.

  • For both these data sets, which of the three hyperparameters we tuned seems to have the biggest effect on model performance overall?

Gradient-Boosted Trees

To fit boosted trees, we will use the same recipe as before, but now we will set up two separate boosted tree workflows, one for regression and one for classification. We’ll tune mtry and trees again; we could also choose to tune min_n again, but the learning rate tends to have a much bigger impact on the performance of gradient-boosted models, and we don’t want to increase the size of the grid too much, so we’ll add learn_rate instead of min_n.

The default engine for gradient-boosted trees is xgboost, which stands for “eXtreme Gradient Boosting.”

bt_reg_spec <- boost_tree(mtry = tune(), 
                           trees = tune(), 
                           learn_rate = tune()) %>%
  set_engine("xgboost") %>% 
  set_mode("regression")

bt_reg_wf <- workflow() %>% 
  add_model(bt_reg_spec) %>% 
  add_recipe(rec_wage)

bt_class_spec <- boost_tree(mtry = tune(), 
                           trees = tune(), 
                           learn_rate = tune()) %>%
  set_engine("xgboost") %>% 
  set_mode("classification")

bt_class_wf <- workflow() %>% 
  add_model(bt_class_spec) %>% 
  add_recipe(rec_penguin)

We set up a grid again, leaving the ranges of mtry and trees the same as before, but this time specifying the range for learn_rate. This is the default, which goes from \(-10\) to \(-1\) in log based ten (the default transformation; see ?learn_rate), or from \(10^{-10}=1 \times 10^{-10}\) to \(10^{-1}=0.1\).

bt_grid <- grid_regular(mtry(range = c(1, 6)), 
                        trees(range = c(200, 600)),
                        learn_rate(range = c(-10, -1)),
                        levels = 5)
bt_grid
## # A tibble: 125 × 3
##     mtry trees   learn_rate
##    <int> <int>        <dbl>
##  1     1   200 0.0000000001
##  2     2   200 0.0000000001
##  3     3   200 0.0000000001
##  4     4   200 0.0000000001
##  5     6   200 0.0000000001
##  6     1   300 0.0000000001
##  7     2   300 0.0000000001
##  8     3   300 0.0000000001
##  9     4   300 0.0000000001
## 10     6   300 0.0000000001
## # … with 115 more rows

We now tune all of these models for both data sets. Note that this code chunk and the following both specify eval=FALSE, meaning that they will not be run each time the .Rmd file is knitted – essentially because running them takes so long. Instead, the results are saved and read back in in a subsequent code chunk.

tune_bt_reg <- tune_grid(
  bt_reg_wf, 
  resamples = wage_folds, 
  grid = bt_grid
)
tune_bt_class <- tune_grid(
  bt_class_wf,
  resamples = penguin_folds,
  grid = bt_grid
)
save(tune_bt_reg, file = "tune_bt_reg.rda")
save(tune_bt_class, file = "tune_bt_class.rda")

We read the results in and take a look at the autoplot() for each set of models:

load("tune_bt_reg.rda")
load("tune_bt_class.rda")

autoplot(tune_bt_reg) + theme_minimal()

autoplot(tune_bt_class) + theme_minimal()

Then, just as before, we’ll select the optimal boosted tree model for each data set in terms of RMSE and ROC AUC, respectively:

show_best(tune_bt_reg, n = 1)
## # A tibble: 1 × 9
##    mtry trees learn_rate .metric .estimator  mean     n std_err .config         
##   <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1     1   200        0.1 rmse    standard    34.1     5    1.02 Preprocessor1_M…
best_bt_reg <- select_best(tune_bt_reg)

show_best(tune_bt_class, n = 1)
## # A tibble: 1 × 9
##    mtry trees learn_rate .metric .estimator  mean     n std_err .config         
##   <int> <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>           
## 1     2   300        0.1 roc_auc hand_till      1     5       0 Preprocessor1_M…
best_bt_class <- select_best(tune_bt_class)

Activities

  • Practice interpreting the results for each of these sets of models.

Model Selection

Let’s make a table to present the RMSE and ROC AUC values of our best-fitting models (across folds) for each of these two data sets:

Random Forest Boosted Tree
RMSE 33.74469 34.07487
Random Forest Boosted Tree
ROC AUC 0.9998485 1

We’ll go with the best random forest model for the wage data and the best boosted tree model for the penguin data. Here, we fit each of those to the entire training set(s) respectively.

Regression Model

Note that we specify importance = "impurity" in the workflow setup above; this is so that we can use extract_fit_parsnip() and vip() to create and view a variable importance plot.

final_rf_model <- finalize_workflow(rf_reg_wf, best_rf_reg)
final_rf_model <- fit(final_rf_model, wage_train)
final_rf_model %>% extract_fit_parsnip() %>% 
  vip() +
  theme_minimal()

This plot tells us that the three most useful predictors of wage in this random forest model are whether or not someone has an advanced degree, their age, and whether or not they have health insurance.

Let’s use the model to make predictions for the testing data and take a look at its testing RMSE:

final_rf_model_test <- augment(final_rf_model, wage_test)

rmse(final_rf_model_test, truth = wage, .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        34.4

We could also view a scatterplot of the actual wage values in the testing set versus the model-predicted values:

final_rf_model_test %>% 
  ggplot(aes(x = wage, y = .pred)) +
  geom_point(alpha = 0.5) +
  theme_minimal()

Classification Model

Now for our boosted tree model. We did not have to include importance = "impurity" in the workflow for this because the package xgboost, the default engine, stores variable importance information by default.

final_bt_model <- finalize_workflow(bt_class_wf, best_bt_class)
final_bt_model <- fit(final_bt_model, penguin_train)

Here is the VIP plot:

final_bt_model %>% extract_fit_parsnip() %>% 
  vip() +
  theme_minimal()

This tells us that the top three most important predictors of species were bill length and depth and flipper length.

Let’s use the model to make predictions for the testing data and take a look at its testing ROC AUC.

Notice that there is a key change here to accommodate the fact that there are more than two levels of our outcome variable. Rather than specifying .pred_Yes, for instance, we have to select all the columns of predicted probabilities – the probability that an observation is species Adelie, Chinstrap, and Gentoo. We can do that by typing .pred_Adelie:.pred_Gentoo, which selects not only the first and last columns but any columns in the middle.

final_bt_model_test <- augment(final_bt_model, 
                               penguin_test) %>% 
  select(species, starts_with(".pred"))

roc_auc(final_bt_model_test, truth = species, .pred_Adelie:.pred_Gentoo)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc hand_till          1

The ROC AUC on the testing data is basically 1; the model has done a basically perfect job classifying species. Let’s generate the three individual ROC curves:

roc_curve(final_bt_model_test, truth = species, .pred_Adelie:.pred_Gentoo) %>% 
  autoplot()

And the confusion matrix:

conf_mat(final_bt_model_test, truth = species, 
         .pred_class) %>% 
  autoplot(type = "heatmap")

The best-performing boosted tree model has literally done an almost perfect job. It has only misclassified one single penguin in the entire testing set – a penguin that is actually of species Adelie was predicted to be of species Chinstrap.

Activities

  • Interpret the scatterplot of predicted vs. actual values for the wage data. Does our model tend to overpredict or underpredict wage? Do you think excluding those observations with an extremely high wage might affect model performance? Do you think it might make sense to do so? Why or why not?

  • Do you think using a different threshold for probabilities would result in the penguin model’s correctly predicting every single observation? What threshold value(s) was used in creating the confusion matrix here? How do you know?

Resources

The free book Tidy Modeling with R is strongly recommended.